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Abstract 

We consider the problem of distributed online learning with multiple players in multi-armed bandits 
(MAB) models. Each player can pick among multiple arms. When a player picks an arm, it gets a reward. 
We consider both i.i.d. reward model and Markovian reward model. In the i.i.d. model each arm is 
modelled as an i.i.d. process with an unknown distribution with an unknown mean. In the Markovian 
model, each arm is modelled as a finite, irreducible, aperiodic and reversible Markov chain with an 
unknown probability transition matrix and stationary distribution. The arms give different rewards to 
different players. If two players pick the same arm, there is a "colhsion", and neither of them get any 
reward. There is no dedicated control channel for coordination or communication among the players. 
Any other conamunication between the users is costly and will add to the regret. We propose an onhne 
index-based distributed learning policy called dUCB4 algorithm that trades off exploration v. exploitation 
in the right way, and achieves expected regret that grows at most as ne ar-O {log^ T). The motivation 
comes from opportunistic spectrum access by multiple secondary users in cognitive radio networks 
wherein they must pick among various wireless channels that look different to different users. This is 
the first distributed learning algorithm for multi-player MABs to the best of our knowledge. 
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I. Introduction 

In [[U, Lai and Robbins introduced the classical non-Bayesian multi-armed bandit model. 
Such models capture the essence of the learning problem that players face in an unknown 
environment, where the players must not only explore to learn but also exploit in choosing the 
best arm. Specifically, suppose a player can choose between N arms. Upon choosing an arm i, 
it gets a reward from a distribution with density f(x,9i). Time is slotted, and players do not 
know the distributions (nor any statistics about them). The problem is to find a learning policy 
that minimizes the expected regret over some time horizon T. It was shown by Lai and Robbins 
m that there exists an index-type policy that achieves expected regret that grows asymptotically 
as logT, and this is order-optimal, i.e., there exists no causal policy that can do better. This 
was generalized by Anantharam, et al [2J to the case of multiple plays, i.e., when the player 
can pick multiple arms at the same time. In [3 J, Agrawal proposed a sample mean based index 
policy which achieves logT regret asymptotically. Assuming that the rewards are coming from 
a distribution of bounded support, Auer, et al HI proposed a much simpler sample mean based 
index policy, called UCBi, which achieves logT uniformly over time, not only asymptotically. 
Also, unlike the policy in [|3l, the index doesn't depend on the specific family of distributions 
that the rewards come from. 

In [|5]|, Anantharam, et al proposed a policy to the case where the arms are modelled as 
Markovian, not i.i.d. The rewards are assumed to come from a finite, irreducible and aperiodic 
Markov chain represented by a single parameter probability transition matrix. The state of each 
arm evolves according to an underlying transition probability matrix when the arm is played and 
remains frozen when passive. Such problems are called rested Markovian bandit problems (where 
rested refers to no state evolution until the arm is played). In [|6l, Tenkin and Liu extended the 
UCBi policy to the case of rested Markovian bandit problems. If some non-trivial bounds on the 
underlying Markov chains are known a priori, they showed that the policy achieves log T regret 
uniformly over time. Also, if no information about the underlying Markov chains is available, the 
policy can easily be modified to get a near-0(\ogT) regret asymptotically. The models in which 
the state of an arm continues to evolve even when it is not played are called restless Markovian 
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bandit problems. Restless models are considerably more difficult than the rested models and 
have been shown to be P-SPACE hard 0. This is because the optimal policy no longer will be 
to "play the arm with the highest mean reward". ^ employs a weaker notion of regret {weak 
regret) which compares the reward of a policy to that of a policy which always plays the the 
arm with the highest mean reward. They propose a policy which achieves logT (weak) regret 
uniformly over time if certain bounds on the underlying Markov model are known a priori and 
achieves a near-O(logT) (weak) regret asymptotically when no such knowledge is available. [[9l 
proposes another simpler policy which achieves the same bounds for weak regret. [[TOl proposes 
a policy based on deterministic sequence of exploration and exploitation and achieves the same 
bounds for weak regret. In [fTTI . the authors consider the notion of strong regret and propose a 
policy which achieves near-log T (strong) regret for some special cases of the restless model. 

Recently, there is an increasing interest in multi-armed bandit models, partly because of 
opportunistic spectrum access problems. Consider a user who must choose between N wireless 
channels. Yet, it knows nothing about the channel statistics, i.e., has no idea of how good or 
bad the channels are, and what rate it may expect to get from each channel. The rates could be 
learnt by exploring various channels. Thus, these have been formulated as multi-armed bandit 
problems, and index-type policies have been proposed for choosing spectrum channels. In many 
scenarios, there are multiple users accessing the channels at the same time. Each of these users 
must be matched to a different channel. These have been formulated as a combinatorial multi- 
armed bandit problem lfT2ll |fT3l , and it was shown that an "index-matching" algorithm that 
at each instant determines a matching by solving a sum-index maximization problem achieves 
O(logT) regret uniformly over time, and this is indeed order-optimal. 

In other settings, the users cannot coordinate, and the problem must be solved in a decentralized 
manner. Thus, settings where all channels (arms) are identical for all users with i.i.d. rewards 
have been considered, and index-type policies that can achieve coordination have been proposed 
that get O(logT) regret uniformly over time lfT4ll . [fT5l . [fT6l . [fTOll . A similar result for Markovian 
reward model with weak regret has been shown by [jTO|, assuming some non-trivial bounds on the 
underlying Markov chains are known a priori. The regret scales only polynomially in the number 
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of users and channels. Surprisingly, the lack of coordination between the players asymptotically 
imposes no additional cost or regret. 

In this paper, we consider the decentralized multi-armed bandit problem with distinct arms 
for each players. We consider both the i.i.d. reward model and the rested Markovian reward 
model. All players together must discover the best arms to play as a team. However, since they 
are all trying to learn at the same time, they may collide when two or more pick the same 
arm. We propose an index-type policy dUCB4 based on a variation of the UCBi index. At its' 
heart is a distributed bipartite matching algorithm such as Bertsekas' auction algorithm [17]. 
This algorithm operates in rounds, and in each round prices for various arms are determined 
based on bid- values. This imposes communication (and computation) cost on the algorithm 
that must be accounted for. Nevertheless, we show that when certain non-trivial bounds on the 
model parameters are known a priori, the dUCB4 algorithm that we introduce achieves (at most) 
near-0{\og^ T) growth non-asymptotically in expected regret. If no such information about the 
model parameters are available, dUCB4 algorithm still achieves (at most) near- 0{\og^T) regret 
asymptotically. A lower bound, however, is not known at this point, and a work in progress. 

The paper is organized as follows. In Section |ll} we present the model and problem formulation. 



In section [111] and |IV] we present some variations on single player MAB with i.i.d. rewards and 
Markovian rewards respectively. In section |V| we introduce the decentralized MAB problem with 
i.i.d. rewards. We then extend the results to the decentralized cases with Markovian rewards in 
section 



VI In section VII we present the distributed bipartite matching algorithm which is used 



in our main algorithm for decentralized MAB. In section VIII we present some simulation 



results to numerically evaluate the performance of our algorithm. 

II. Model and Problem Formulation 

A. Arms with i.i.d. rewards 

We consider an A^-armed bandit with M players. In a wireless cognitive radio setting [jTSl, 
each arm could correspond to a channel, and each player to a user who wants to use a channel. 
Time is slotted, and at each instant each player picks an arm. There is no dedicated control 
channel for coordination among the players. So, potentially more than one players can pick the 
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same arm at the same instant. We will regard that as a collision. Player i playing arm k at time t 
yields i.i.d. reward Sik{t) with univariate density function f{s,9ik), where 9ik is a parameter in 
the set Oik. We will assume that the rewards are bounded, and without loss of generality lie in 
[0, 1]. Let Hi^k denote the mean of Sik{t) w.r.t. the pdf /(s, 9ik). We assume that the parameter 
vector 9 = {9ij, 1 < i < M, 1 < j < iV) is unknown to the players, i.e., the players have 
no information about the mean, the distributions or any other statistics about the rewards from 
various arms other than what they observe while playing. We also assume that each player can 
only observe the rewards that they get. When there is a collision, we will assume that all players 
that choose the arm on which there is a collision get zero reward. This could be relaxed where 
the players share the reward in some manner though the results do not change appreciably. 

Let Xij (t) be the reward that player i gets from arm j at time t. Thus, if player i plays arm 
k at time t (and there is no collision), Xik{t) — Sik{t), and Xij(t) = 0, j 7^ k. Denote the action 
of player i at time t by aj(t) G A := {1, . . . , A^}. Then, the history seen by player i at time t is 



Hiit) = {(ai(l),X,,„,(i)(l)),-- - ,(ai(i),Xi,„,(i)(t))} with Hi{0) = 0. A policy - {ai{t))Zi 



for player i is a sequence of maps ai{t) : 1-Li{t) — > A that specifies the arm to be played at time 
t given the history seen by the player. Let V{N) be the set of vectors such that 



The players have a team objective: namely over a time horizon T, they want to maximize the 
expected sum of rewards lE[X]^i Sili ^i,ai(t)(^)] ^^^^ some time horizon T. If the parameters 
jiij are known, this could easily be achieved by picking a bipartite matching 



i.e., the optimal bipartite matching with expected reward from each match. Note that this may 
not be unique. Since the expected rewards, /ijj, are unknown, the players must pick learning 
policies that minimize the expected regret, defined for policies a = (a^, 1 < i < M) as 



V{N) := {a = (ai, . . . , au) : a^e A,ai^ aj, for i ^ j}- 



M 




(1) 



T M 



(2) 



t=i 1=1 
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Our goal is to find a decentralized algorithm that players can use such that together they minimize 
the expected regret. 

B. Arms with Markovian rewards 

Here we follow the model formulation introduced in the previous subsection, with the ex- 
ception that the rewards are now considered Markovian. The reward that player i gets from 
arm j (when there is no collision) Xij, is modelled as an irreducible, aperiodic, reversible 
Markov chain on a finite state space X^''^ and represented by a transition probability matrix 
pi,j ._ Ip^,3 ^ ■ x^x e X''^ . We assume that rewards are bounded and strictly positive, and 
without loss of generality lie in (0,1]. Let tt*'-' := (tt*'-', x G X^'^) be the stationary distri- 
bution of the Markov chain P*'^. The mean reward from arm j for player i is defined as 
Hi J := X^xeA-'j '^^.x^- Note that the Markov chain represented by P''^ makes a state transition 
only when player i plays arm j. Otherwise it remains rested. 

We note that although we use the 'big O' notation to emphasis the regret order, unless 
otherwise noted results are non-asymptotic. 

III. Some variations on single player multi- armed bandit with i.i.d. rewards 

We first present some variations on the single player non-Bayesian multi-armed bandit model. 
These will prove useful later for the multi-player problem though they should also be of 
independent interest. 

A. UCBi with index recomputation every L slots 

Consider the classical single player non-Bayesian A^^-armed bandit problem. At each time t, the 
player picks a particular arm, say j, and gets a random reward Xj{t). The rewards 1 < t < 

T are independent and identically distributed according to some unknown probability measure 
with an unknown expectation jij. Without loss of generality, assume that jii > jii > ji^, for 
i = 2^, - ■ ■ N — 1. Let nj{t) denote the number of times arm j has been played by time t. Denote 
Aj := Hi — jjij, Ajnin := min^j^i and A^nax '■= max^ A^-. The regret for any policy a is 

TV 

UaiT) - J2 l^j^a[nj{T)]. (3) 
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UCBi index dH is defined as 

where Xj{t) is the average reward obtained by playing arm j by time t. It is defined as Xj(t) = 
X]L=i where rj{m) is the reward obtained from arm j at time m. If the arm j 
is played at time t then rj{m) = Xj(m) and otherwise rj(t) = 0. Now, an index-based policy 
called UCBi [4J is to pick the arm that has the highest index at each instant. It can be shown 
that this algorithm achieves regret that grows logarithmically in T non-asymptotically. 

An easy variation of the above algorithm which will be useful in our analysis of subsequent 
algorithms is the following. Suppose the index is re-computed only once every L slots. In that 
case, it is easy to establish the following. 

Theorem 1. Under the UCBi algorithm with recomputation of the index once every L slots, the 
expected regret by time T is given by 

K.c=.(T)<i: ^ + A, (5) 

J>1 ^ ^ ^ j>l 

The proof follows H and taking into account the fact that every time a suboptimal arm is 
selected, it is played for the next L time slots. We omit it due to space consideration. 

B. UCB4 Algorithm when index computation is costly 

Often, learning algorithms pay a penalty or cost for computation. This is particularly the 
case when the algorithms must solve combinatorial optimization problems that are NP-hard. 
Such costs also arise in decentralized settings wherein algorithms pay a communication cost for 
coordination between the decentralized players. This is indeed the case, as we shall see later 
when we present an algorithm to solve the decentralized multi-armed bandit problem. Here, 
however, we will just consider an "abstract" communication or computation cost. The problem 
we formulate below can be solved with better regret bounds than what we present. At this 
time though we are unable to design algorithms with better regret bounds, that also help in 
decentralization. 
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Consider a computation cost every time the index is recomputed. Let the cost be C units. Let 
m(t) denote the number of times the index is computed by time t. Then, under policy a the 
expected regret is now given by 

N 

n^iT) := fiiT - ^/i,-E,[n,-(T)] + CE,[m(T)]. (6) 

It is easy to argue that the UCBi algorithm will give a regret f2(T) for this problem. We present 
an alternative algorithm called UCB4 algorithm, that gives sub-linear regret. Define the UCB4 index 

We define an arm j*{t) to be the best arm if j*(t) G argmaxi<j<7v 



Algorithm 1 : UCB4 
1: Initialization: Select each arm j once for t < N. Update the UCB4 indices. Set 77 = 1. 



2 


while (t < T) do 


3 


if (77 = 2P for some p = 0, 1, 2, • • 


4 


Update the index vector g{t); 


5 


Compute the best arm j*{t); 


6 


if then 


7 


Reset ?7 = 1; 


8 


end if 


9 


else 


10 




11 


end if 


12 


Play arm j*{t); 


13 


Increment counter 77 = 77 + 1; t = 


14 


end while 



We will use the following concentration inequality. 
Fact 1: Chernoff-Hoeffding inequality [fT9l 

Let Xi, . . . ,Xt be random variables with a common range such that E[Xt|Xi, . . . , Xt-i] = fx. 
Let St = Y!i=i Xt. Then for all a > 0, 

P {St >tfi + a)< and P {St <tfi-a)< e'^'^'/*. (8) 

Theorem 2. The expected regret for the single player multi-armed bandit problem with per 
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computation cost C using the UCB4 algorithm is given by 

N 



^ucB.(T) < (A^,, + C(l + logr))- (^^^ + 2iV). 



vj>l J 



Thus, n^cB,{T) = 0{\og^T). 



Proof: We prove this in two steps. First, we compute the expected number of times a 
suboptimal arm is played and then the expected number of times we recompute the index. 



Consider any suboptimal arm j > 1. Denote Qs = A/Slogt/s and the indicator function of 
the event A by I {A}, let be the time at which the player makes the mth transition to the 
arm j from another arm and be the time at which the player makes the mth transition from 
the arm j to another arm. Let f^'^ = min{rj'^,T}. Then, 
rijiT) < 1 + YZi=ii.^'j,m - rj,m)/{Arm j is picked at time r,-m,rj-„ < T] 

T 

- 1 + ^ Tj,m)I{9j{rj,m - 1) > 9l{'rj,m " 1), Tj'^ < T} 

m=l 
T 

< I + ^("^'j^-m - ^3,m)I{9j{.Hm " 1) > 9l{.Tj,m - 1), Tj^m < T, rij^Tj^m - 1) > 

m=l 

(a) °° 

< ^''^{9j{rj,m + 2P-2)> gi{Tj,m + 2^-2), r,-„ + 2^ < T, n,(r,-„ - 1) > /} 

m=l p=0 

< 1+2^2^ '^'^I{9jijn + 2" - 2) > gi{m + 2^ - 2),m + 2P < T, Ujim - 1) > /} 

m=2 p=0 
T 

< / + 5Z I] 2P/{Xj(m + 2^ - 1) + C„+2P-1,„^.(™+2P-1) > 

m=l p>0,m+2P<T 

1) > (9) 



< / + y y 2P/{ max XAm + 2^ - 1) + c„+2p-i > 

^ ^ l<Si<m+2P / -r . J 

m=l p>0,m+2P<T 

min Xi(m + 2^ - 1) + Cm+2P-i,si} 

l<si<m+2P 

< / + ^ ^ I{Xj{m + 2P) + Cm+2r,s, >Xi{m + 2^) + Cm+2P,s,m 

m=lp>0,m+2P<T si=l Sj=l 

In Algorithm [T] (UCB4), if an arm is for the pth time consecutively (without switching to any 
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other arms in between), it is be played for the next 1? slots. Inequality (a) uses this fact. In the 
inequality (b), we replace Tj „i by m which is clearly an upper bound. Now, observe that the 
event {Xj{m + 2^) + Cm+2P,sj > Xi(m + 2^) + Cm+2P,si} implies at least one of the following 
events, 

A := {Xi(m + 2'') < /ii -c^+2P,sJ, 5 := {Xj(m + 2^) > /i^- + c^+2p,,J, 

or C := {yUi < fij + 2c.m+2P,s,}- (11) 

Now, using the Chemoff-Hoeffding bound, we get 

P (Xi(m + 2^) < /ii - c„+2P,.J < (m + 2^)-', P (X,(m + 2^) > /i,- + c^+2p,.,) < (m + 2^)-^ 

is false. In fact, fii — /ij — 2cm+2P,sj 
= /ii - /ij - 2^3 log(m + 2P) /sj > /ii - - = 0, for sj > \l2 log T/A|] . 

oo oo m+2P m+2P 

So, we get, E[n,{T)] < [l21ogr/A2] + 2^ 2(m + 2^'' 

< [l21ogr/A2] + 2f^f]2f(m + 2P)-^<^^^ + 2. (12) 

m=l p=0 J 

Next, we upper-bound the expectation of m(T), the number of index computations performed 
by time T. We can write m(T) = mi(T) +m2(T), where mi(T) is the number of index updates 
that result in an optimal allocation, and m2(T) is the number of index updates that result in a 
suboptimal allocation. Clearly, the number of updates resulting in a suboptimal allocation is less 
than the number of times a suboptimal arm is played. Thus, 

N 

E[m2(T)] <5^E[n,(T)]. (13) 

i>i 

To bound E[mi(T)], let ti be the time at which the player makes the /th transition to an optimal 
arm from a suboptimal arm and r/ be the time at which the player makes the /th transition from 
an optimal arm to a suboptimal arm. Then, nii (T) < Y.Zf^'^'^ log I Ti - r/|, where 

^sub 

(T) is 

the total number of such transitions by time T. Clearly, nsub{T) is upper-bounded by the total 



For / 



12 log r 



the last event in ( 1 1 
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number of times the player picks a sub-optimal arm. Also, log In — t[\ < logT. So, 

N 

E[mi(T)] < ^E[nj(T)] ■ logT. (14) 



Thus, from bounds ( [13) ) and ( [T4| ), we get 

N 



^m{T)] < ^E[n,(T)] ■ (1 + logT). (15) 



Now, using equation ([6]), the expected regret is 



N N 

n^cB^T) = ^E[n,(T)]-A, + CEHT)]< A„,,.^E[n,(T)]+CE[m(T)] 
i>i i>i 

AT 

< (A„,, + C(l+logr))^E[n,(T)]. 



by using (15). Now, by bound (12), we get the desired bound on the expected regret. ■ 
Remarks. 1 . It is easy to show that the lower bound for the single player MAB problem with 
computation costs is r2(logT). This can be achieved by the UCB2 algorithm [4J. To see this, 
note that the number of times the player selects a suboptimal arm when using UCB2 is O(logT). 
Since E[nj(T)] = O(logr), we get E^^^^^ nj{T)] = O(logT), and also E[m2(T)] = O(logr). 
Now, since the epochs are not getting reset after every switch and are exponentially spaced, the 
number of updates that result in the optimal allocation, mi(T) < logT. These together yield 

N 

^DCB.(T) < 5^E[n,(T)] • A,- + CE[m{T)] = O(logT). 
i>i 

2. Variations of the UCB2 algorithm that use a deterministic schedule can also be used ll20ll . But 
it is unknown at this time if these can be used in solving the decentralized MAB problem that 
we introduce in the next section. This is the main reason for introducing the UCB4 algorithm. 

C. Algorithms with finite precision indices 

Often, there might be a cost to compute the indices to a particular precision. In that case, 
indices may be known upto some e precision, and it may not possible to tell which of two 
indices is greater if they are within e of each other. The question then is how is the performance 
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of various index-based policies such as UCBi,UCB4, etc. affected if there are limits on index 
resolution, and only an arm with an e-highest index can be picked. We first show that if t^min 
is known, we can fix a precision < e < Amm, so that UCB4 algorithm will achieve order log- 
squared regret growth with T. If l^min is not known, we can pick a positive monotone sequence 
{e^} such that — )■ 0, as t — )• 00. Denote the cost of computation for e-precision be C(e). We 
assume that C(e) — )■ 00 monotonically as e — i- 0. 

Theorem 3. { i) If A^m is known, choose an < e < A^in- Then, the expected regret of the 
UCB4 algorithm with e-precise computations is given by 

121 T \ 

Thus, UucB^iT) = 0{log^T). 

(ii) If Amin is unknown, denote emm = ^min/'^ (^nd choose a positive monotone sequence {tt} 
such that et ^ as t ^ 00. Then, there exists a to > such that for all T > t^, 

where to is the smallest t such that ej^ < Cmm- Thus by choosing an arbitrarily slowly increasing 
sequence [tt], we can make the regret arbitrarily close to O(log^T) asymptotically. 

Proof: (i) The proof is only a slight modification of the proof given in Theorem [2} Due to 
the e precision, the player will pick a suboptimal arm if the event {Xj{m + 2^) + Cm+2P,sj + e > 
Xiim + 2^) + Cm+2v,si} occurs. Thus equation (j9| becomes, rijiT) 

00 m+2P m+2P 

<l + Yl J2 '^"HH liXj^rn + 2P) + c^+2P,., + e > Xi(m + 2^^) + c^+2p,.J. 

m=lp>0,m+2P<T si=l Sj=l 

Now, the event {Xj{m + 2^) + Cm+2P,Sj + e > Xi{m + 2^) + Cm+2P,si} implies that at least one 
of the following events must occur: 

A := {Xi(m + 2P) < /ii - Cm+2P,s^], B := {Xj{m + 2^) >^lj + e + c^+2P,s,}, 

C := {/ii < /ij + e + 2cm+2P,s,}, ox D := [ill < + t] . (16) 
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Since {Xj{m + 2^) > fij + e + 

Cm.+2P,Sj} ^ {Xj{m + 2^) > iij + Cm+2P,sj}j wc havc 
P({X,(m + 2^^) > fij+e + Cm+2P,s,}) < F({X,(m + 2^) > /i,- + c™+2p,sJ). 
Also, for / = [12 logT/ (Aj — e)^] , the event C cannot happen. In fact, yUi— /ij — e — 2q+2p,s 



/ii - /i, - e - 2J^i^^^J+^ > _ _ e - (A, - e) = 0, for s, > [121ogr/(A, - e)^]. If 



e < Amin, the last event (D) in equation (16) is also not true. Thus, for < e < A^j^, we get 



EMT)l<;|i^ + 2. (17) 

The rest of the proof is the same as in Theorem [2] Now, if A^in is known, we can choose 
< e < Amin and by Theorem |2] and bound ( [T7| ), we get the desired result, 
(ii) If Amin is unknown, we can choose a positive monotone sequence {et} such that — )■ as 
t — )■ oo. Thus, there exists a to such that for t > to, tt < tmin- We may get a linear regret upto 
time to but after that the analysis follows that in the proof of Theorem [2| and regret grows only 
sub-linearly. Since C(-) is monotone, (7(6^) > C{et) for all t < T. The last part can now be 
trivially established using the obtained bound on the expected regret. ■ 

IV. Single Player Multi-armed Bandit with Markovian Rewards 

Now, we consider the scenario where the rewards obtained from an arm are not i.i.d. but come 
from a Markov chain. Reward from each arm is modelled as an irreducible, aperiodic, reversible 
Markov chain on a finite state space X'' and represented by a transition probability matrix : = 
[p^^^i : x,x' G A'^j. Assume that the reward space C (0, 1]. Let Xi(l),Xj(2), . . . denote the 
successive rewards from arm i. All arms are mutually independent. Let tt' := (7r^,x G X'') be 
the stationary distribution of the Markov chain P\ Since the Markov chains are ergodic under 
these assumptions, the mean reward from arm i is given by /x, := XlxeA'' ^^x- Without loss of 
generality, assume that fii > Hi > fij^j, for i = 2, ■ ■ ■ N — 1. As before, nj{t) denotes the number 
of times arm j has been played by time t. Denote A^ := /ii ~ i^j, Amin '■= ^^^jjf^i 
Amax ■= maxjAj. Denote Tr^m := mmi<i<,^^^^x- irl, Xmax ■= maxi<i<N,xex^ x and Xmin ■ = 
mmi<i<N,x&x^ X. Let vr^ := max{7r^, 1 - tt^} and f^rnax ■= ^^^i<i<N,xex- K- Let \X'\ denote 
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the cardinality of the state space X\ \X\max '■= niaxi<j<7v \X^\- Let be the eigenvalue gap, 

■ 2 

1 — A2, where A2 is the second largest eigenvalue of the matrix . Denote pmax '■= niaxi<j<7v 
and pmin '■= mini<j<Ar p\ where is the eigenvalue gap of the ith arm. 

The total reward obtained by the time T is then given by St = J2f=iYl's=P -^ji^)- The 
regret for any policy a is defined as 

TZmAT) :=PiT-E« J] Yl X,{s) + CE4m{T)] (18) 

j=l s=l 

where C is the cost per computation and m(T) is the number of times the index is computed 



by time T, as described in section III Define the index 



where Xj (t) is the average reward obtained by playing arm j by time t, as defined in the previous 
section, k can be any constant satisfying k > 168\X\'^^^/ pmin- 

We introduce one more notation here. If and Q are two cr-algebras, then T \l Q denotes 
the smallest a-algebra containing T and Q. Similarly, if {Tt-,t = 1,2,...} is a collection of 
cr-algebras, then \/t>iFt denotes the smallest a— algebra containing J^i, J^2, ■ ■ ■ 

The following can be derived easily from Lemma |4] [5|, reproduced in the appendix. 

Lemma 1. If the reward of each arm is given by a Markov chain satisfying the hypothesis of 
Lemma |^ then under any policy a we have 

N 

-RmAT) < Yl + ^^^P + CE„[m(r)] (20) 

i=2 

where Kx,p = Ejli 'Zxex^ ^/^Ln ^Ln = ^^^xexi T^i 

Proof Let Xj(l), Xj(2), . . . denote the successive rewards from arm j. Let denotes the 
cr-algebra generated by (Xj(l), . . . ,Xj(t)). Let J^^ = \/t>iJ^l and = Mi^jF"^. Since arms 
are independent, is independent of Clearly, nj{T) is a stopping time with respect to 
g^WT^. The total reward is St = EtiE^s^P Xjis) = EtiExex^ ^N{x,n,iT)) where 
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N{x,nj{T)) := Ylt=P ^{^jit) = x}. Taking the expectation and using the Lemma 4 we have 

^St] - Ef=i ExeAf.- ^^i^^jiT)] < J2f=i ^hLin^ which implies 



E[5t] - Ell /^jE[nj(r)] < Kx,p, where Kx,p = Eli ^xex^ ^/t^Lu- Since regret 



N 



Km AT) = f^iT - Y!l=r X,{t) + CE„[m(T)] (c.f. equation ^\ we get 

\nM,a{T) - (^iT - ^/i,E[n,(T)] + CE,[m(T)] j | < Kx,p. 

■ 

We will use a concentration inequality for Markov chains (Lemma [5} from [llTTl ). reproduced 
in the appendix. 

Theorem 4. (i) If\X\max <^nd pmm <^re known, choose k > 168| A'|^^^/pmm- Then, the expected 
regret using the UCB4 algorithm with the index defined as in ([79]) /or the single player multi-armed 
bandit problem with Markovian rewards and per computation cost C is given by 

nM,UCB,{T) < {A^a. + C(l + logT)) ■ J2 "^A? + ^^^^ + 1) + Kx,P 

\j>l ^ J 

where D = Thus, nM,ucBAT) = O(log'T). 

(ii) If \X\max cind pmin ci^e not known, choose a positive monotone sequence {nt} such that 
K( — 7- 00 a5 t — 7- 00 and Kt < t. Then, 'JZM,i!CBiiT) = 0(/€j'log^T). Thus, by choosing an 
arbitrarily slowly increasing sequence {ki} we can make the regret arbitrarily close to log^T. 



Proof: (i) Consider any suboptimal arm j > 1. Denote = y /«logt/s. As in the proof 
of Theorem [2| we start by bounding nj{T). The initial steps are the same as in the proof of 
Theorem [2} So, we skip those steps and start from the inequahty (|9]) there. 

00 m+2P m+2P 

m=lp>0,m+2P<r si=l Sj=l 

The event {Xj{m + 2^) + Cm+2P,Sj > Xi{m + 2^) + Cm+2P,si} is true only if at least one of the 



events shown in display (11) are true. We note that, for any initial distribution A-' for arm j, 
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N: 



xi 



2 xeA'i 



A-' 



< 



(21) 



Also, Xmax < 1- Let nKsj) be the number of times the state x is observed when arm j is pulled 



Sj times. Then, the probability of the first event in (11), 



We 



(a) 
< 



Ep 



SjCm+2P 



x&Xi ^ 11/ x^Xi 



x\XATii 



The inequality (a) follows after some simple algebra, which we skip due to space limitations. 
The inequality (b) follows by defining the function f{Xj{t)) = (I{Xj{t) = x} — 7r^.)/^x- 
using the Lemma [Sj For inequality (c) we used the facts that Nxj < l/iXmin, Xmax < 1 and 

T^max < 1- Thus, 



P(Xj(m + 2P) > ft J + c„+2P,.J <D{m + 2^) 



^Pmin 1 28 1 I max I ^ 



\X\ 

where D = liitl^e Similarly we can get, 



(22) 



P(Xi(m + 7F) <ii^- c^+2P,sJ < D{m + IF) 



f^Pmin /28| i-f I maa? | 



(23) 



For / = [4fi;IogT/A^], the last event in (11 1 is false. In fact, /ii — fXj — 2c 



fil - fij - 2 JKlog(m + 2P)/sj > yUi - fij - Aj = 0, for sj > \4:K\ogT/A'^] . Thus, 



E[n,(T)] < 



4k log T' 



A2 



+ ^^2^^ ^2D(m + 2P) 281-^1— P 

m=l p=0 si=l s,=l 
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4/tlogT' 



171=1 p=0 



(24) 



When K > 168\X\'f^^^/ Pmin, the above summation converges to a value less that 1 and we get 

(25) 



EK-(T)]<^^^ + (2D + 1) 



Now, from the proof of Theorem [2] (equation ( 15 )), 

N 

E[m(T)]<5^E[n,(T)]-(l + logT). 



(26) 



Now, using inequality (20), the expected regret TZM^ac^iT) 

N 



N 



i>i 



= J]E[n,(T)] • A, + CE[m(T)] + iT^^.p < A^,, ^ E[n,(T)] + CE[m(T)] + iT^^.p 

AT 

< (A„,, + C{1 + logT)) ^E[n,(T)] + K^^^p. 



by using (26). Now, by bound (25), we get the desired bound on the expected regret. 



(ii) Replacing k with Kt, equation (24) becomes 

"4KTlogT 



E[n,(T)] < 



+ 2D^^2''(m + 2P)" 

m,=l p=0 



Since, ^ oo as t ^ cx), the exponent _ becomes smaller that -4 for 

sufficiently large m and p, and the above summation converges, yielding the desired result. ■ 
We note that we have used the results in [22] in the above proof. We note that the results for 
Markovian reward just presented extend easily even with finite precision indices. As before, sup- 
pose the cost of computation for e-precision is C(e). We assume that C(e) — i- 00 monotonically 



as e — !■ 0. We formally state the following result, which we will use in section VI 



Theorem 5. (i) If Amm, \'^\max (^nd pmin iire known, choose an < e < Amin, and a k > 
168\X\f^^^/ Pmin- Then, the expected regret using the UCB4 algorithm with the index defined as 



in {19) for the single player multi-armed bandit problem with Markovian rewards with e-precise 
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computations is given by 

N 



^M,ucB.(T) < (A^„, + C(e)(l + logT))- (^|J^^^ + iV(2D + l)j. 



where D = Thus, nM,ucB,{T) = 0{\og^T). 

(ii) If A„iin, \X\max (^^d Pmin <^re unknown, choose a positive monotone sequences {e^} such 
that and {nt} such that Kt < t, et ^ and — )■ oo t — > oo. Then, 7^m,ucb4(7') = 
0{C{eT)nT log^ T). We can choose {ct} and {ki} as two arbitrarily slowly increasing sequences 
and thus the regret can be made arbitrarily close to log^(T). 

The proof follows by a combination of the proof of the theorems |3] and |4} and is omitted. 

V. The Decentralized MAB problem with i.i.d. rewards 

We now consider the decentralized multi-armed bandit problem with i.i.d. rewards wherein 
multiple players play at the same time. Players have no information about means or distribution 
of rewards from various arms. There are no dedicated control channels for coordination or 
communication between the players. If two or more players pick the same arm, we assume that 
neither gets any reward. Tshis is an online learning problem of distributed bipartite matching. 

Distributed algorithms for bipartite matching algorithms are known [|23l . [|24l which determine 
an e-optimal matching with a 'minimum' amount of information exchange and computation. 
However, every run of this distributed bipartite matching algorithm incurs a cost due to com- 
putation, and communication necessary to exchange some information for decentralization. Let 
C be the cost per run, and m(t) denote the number of times the distributed bipartite matching 
algorithm is run by time t. Then, under policy a the expected regret is 



M 



1=1 



T M 



t=l 1=1 



CE[m{T)]. ill) 



where k** is the optimal matching as defined in equation ([T]) in section II-A 



Temporal Structure. We divide time into frames. Each frame is one of two kinds: a decision 
frame, and an exploitation frame. In the decision frame, the index is recomputed, and the 
distributed bipartite matching algorithm run again to determine the new matching. The length of 
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such a frame can be seen as cost of the algorithm. We further divide the decision frame into two 
phases, a negotiation phase and an interrupt phase (see Figure [T]). The information exchange 
needed to compute an e-optimal matching is done in the negotiation phase. In the interrupt 
phase, a player signals to other players if his allocation has changed. In the exploitation frame, 
the current matching is exploited without updating the indices. Later, we will allow the frame 
lengths to increase with time. 

We now present the dUCB4 algorithm, a decentralized version of UCB4. For each player i and 
each arm j, we define a dUCB4 index at the end of frame t as 

where riiit) is the number of successful plays (without collisions) of player i by frame t, nij(t) 
is the number of times player i picks arm j successfully by frame t. Xij{t) is the sample 
mean of rewards from arm j for player i from nij(t) samples. Let g(t) denote the vector 
idij i't)A < ^ < M,l < j < N). Note that g is computed only in the decision frames using 
the information available upto that time. Each player now uses the dUCB4 algorithm. We will 
refer to an e-optimal distributed bipartite matching algorithm as dBMe((7(t)) that yields a solution 
k*{t) := iklit),...,kl,it)) e V{N) such that T^U9^^m^ ^ TZl9^,kM - Vk G 
P(A^),k ^ k*. Let k** G V{N) be such that k** G argmaxkgp(7v) X^Jfi Z^^.k,, i.e., an optimal 
bipartite matching with expected rewards from each matching. Denote /i** := X]f=i /^«,k**j ^nd 
define Ak := /i** - Ef=iAti,k,, k G V{N). Let A„i„ = mink6P(iv),k^k" Ak and A„„^ = 
maXk6P(Ar) Ak. We assume that Amin > 0. 

In the dUCB4 algorithm, at the end of every decision frame, the dBK^{g{t)) will give a legitimate 
matching with no two players colliding on any arm. Thus, the regret accrues either if the matching 
k(t) is not the optimal matching k**, or if a decision frame is employed by the players to 
recompute the matching. Every time a frame is a decision frame, it adds a cost C to the regret. 
The cost C depends on two parameters: (a) the precision of the bipartite matching algorithm 
ei > 0, and (b) the precision of the index representation £2 > 0. A bipartite matching algorithm 
has an ei-precision if it gives an ei-optimal matching. This would happen, for example, when 
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Algorithm 2 dUCB4 for User i 



Initialization: Play a set of matchings so that each player plays each arm at least once. Set counter 77 = 1. 
while (t < T) do 

if (?7 = 2P for some p = 0, 1, 2, • • • ) then 

//Decision frame: 
Update g{t); 

Participate in the dBMe{g(t)) algorithm to obtain a match k*{ty, 
if (/c*(t)^fc*(t-l))then 

Use interrupt phase to signal an INTERRUPT to all other players about changed allocation; 

Reset 77 = 1; 
end if 

if (Received an INTERRUPT) then 

Reset 77 = 1; 
end if 
else 

// Exploitation frame: 

kt{t) = kt{t-iy, 

end if 

Play arm A;^(t); 

Increment counter rj = t] + 1, t = t + l; 
end while 



such an algorithm is run only for a finite number of rounds. The index has an 62 -precision if 
any two indices are not distinguishable if they are closer than €2. This can happen for example 
when indices must be communicated to other players in a finite number of bits. 

Thus, the cost C is a function of ei and 62, and can be denoted as C (ei, 62), with C (ei, €2) 00 
as €1 or €2 — > 0. Since, ei and €2 are the parameters that are fixed a priori, we consider 
e = min(ei, €2) to specify both precisions. We denote the cost as C(e). 

We first show that if Amin is known, we can choose an e < A„dn/iM + 1), so that dUCB4 
algorithm will achieve order log-squared regret growth with T. If A^nin is not known, we can 
pick a positive monotone sequence {e*} such that et — > 0, as i — > 00. In a decentraUzed bipartite 
matching algorithm, the precision e will depend on the amount of information exchanged in the 
decision frames. It, thus, is some monotonically decreasing function e = f{L) of their length L 
such that e ^ as L — > 00. Thus, we must pick a positive monotone sequence {Lt} such that 

— > 00. Clearly, C{f{Lt)) — >oo asi— >oo. This can happen arbitrarily slowly. 

Theorem 6. ( i) Let e > be the precision of the bipartite matching algorithm and the precision 
of the index representation. If A^j„ is known, choose e > such that e < A^j„/ (M + 1). Let 
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L be the length of a frame. Then, the expected regret of the dUCB4 algorithm is 

n..c.m < {LA^a. + cifimi + logr)) ■ ( (^^^1^^ mViw ^ ^^^^^ ^ ^0 " 

Thus, ^ducB4(T) =0(log'T). 

(ii) When is unknown, denote emin = ^mm/(2(M + 1)) and let Lf oo as t ^ oo. Then, 
there exists a to > such that for all T > to, 

KducB.(r) < (L,.A„.„ + C(/(L,.))«„ + (LtA„.„ + C(/(Lt))(1 + logT)) ■ 

where to is the smallest t such that f{Ltg) < emin- Thus by choosing an arbitrarily slowly 
increasing sequence {Lt} we can make the regret arbitrarily close to log^T. 

Proof: (i) First, we obtain a bound for L = 1. Then, appeal to a result like Theorem [T] to 
obtain the result for general L. The implicit dependence between e and L through the function 
/(■) does not affect this part of the analysis. Details are omitted due to space limitations. 

We first upper bound the number of sub-optimal plays. We define nij{t), I < i < M,l < j < 
N as follows: Whenever the dBy[^(g(t)) algorithm gives a non-optimal matching k(t), nij{t) is 
increased by one for some (i, j) G argmini<j<jvf,i<j<Ar nij{t). Let fi(T) denote the total number 
of suboptimal plays. Then, clearly, h{T) = J2iLi J2f=i^i,ji'^)- So, in order to get a bound on 
n{T) we first get a bound on hi j(T). 

Let Iij{t) be the indicator function which is equal to 1 if nij{t) is incremented by one, at time 
t. When Iij{t) = 1, there will be a corresponding matching k(t) ^ k** such that ki{t) = j. In 
the following, we denote it as k, omitting the time index. A non-optimal matching k is selected 
if the event | EZi9i,krim + 2^ - 1) < (M + l)e + Efii9i,kA^ + 2^ - 1)| happens. If 
each index has an error of at most e, the sum of M terms may introduce an error of atmost 
Me. In addition, the distributed bipartite matching algorithm dBM^ itself yields only an e-optimal 
matching. This accounts for the term (M + l)e above. Since the initial steps are similar to that 
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in Theorem [2| we skip those steps. Thus, similar to the equation we get nij{T) < 

T oo ^ M M 



I + 5^^2''/| ^(7,,fc..(m + 2P-l) < (M+l)e + 5^(7,,,,(m + 2P-l),n,,,(m-l) >/l 

m=l p=0 ^ i=l i=l ^ 

T oo ^ M , N,^ 

< ^ + 5Z 2"^] (^*.'^r("^ + 2^' - 1) + C„+2P_l,n, ,„(^+2P-l) ) 

m=l p=0 i=l ^ ' ^ 

1)>/ 

i=l ^ 

T oo , M 

< / + VV2^/i min Vfe,«(m + 2P-l) + c„+2P-M,,..) 

m=l p=0 ^ '1 ' AI i=l 

< (M + l)e + ^ max fe,,^(m + 2^ - 1) + c„+2.-i,s' , ) [ 

<^+EE2^ E ••• E E ••• E /{E(^^.^r(-+2^)+w2..,,..) 

A/ 

< (M + l)e + 5^ (X,,^(m + 2^) + c„+2.,.U,.) [• (29) 
,•—1 ' ' J 



i=l 

Now, it is easy to observe that the event 

M M 

^ fl^. , J- 0P\ J- . 



\ E (^^^K<^ + 2'') + Cm+2P,s,,,«) < (M + l)e + 5^ (x.^kXm + 2^) + c„ 

i=l ' i=l 

implies at least one of the following events: 



^ M MM 

c ■■= $^^,fcr <(^ + i)^ + E ^ (30) 

1=1 1=1 i=l 



i.kA 



^ M M ^ 

1=1 i=l ' 



Using the Chernoff-Hoeffding inequality, we get F{Ai) < (m + 2p)^2(m+2)^ p^^^.^ < (m + 
2')-='"«', 1 < » < M. For / > l^m^l^c ge, 
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E»=i ^^^,kr - E*=i - (m + i)e - 2 ^f£^ c^+sp,/ 



Af M 



X /(M + 2)log(m + 2P) 

i=l i=l 
M M 

j=l i=l 

The event D is false by assumption. So, we get, E[?7.j j(T)] 

00 00 m+2P m+2P m+2P m+2P 



< 



< 



4M2(M + 2)logT 
{A^in - (M + l)e)2 

4M2(M + 2)logT 



+ EE2^ E ••• E E ••• E 2M(m + 2T^(--^) 

m=lp=0 Si^,.=l SAl,fc"=l<i' =1 s',. =1 
CO 00 

+ 2M^^2P(m + 2P)-^ 

m=l p=0 

4M2(M + 2)logT 



(A„,„ - (M + l)e; 



Now, putting it all together, we get 

M N 



Now, by the proof of Theorem |2](c.f. equation([T5]), E[m(T)] < E[n(r)](l + logT). We can now 
bound the regret, ^<iucB.(r) = EfcePC^),^^" E^=lE[n^,fc,(T)] + CE[m(T)] 

< Ama. E^[^^''^«^^)]+^^["'^^)] 

k&V{N),k^k** i=l 

= A J[n(r)]+CE[m(t)]. 
For a general L, by Theorem [1] we get 

^ducB4(T) < LA„,,E[n(T)] + C(/(L))E[m(T)] < {LA^a. + + logT))E[n(T)]. 



Now, using the bound (33), we get the desired upper bound on the expected regret, 
(ii) Since et = f{Lt) is a monotonically decreasing function of Lt such that — )■ as Lt — oo, 
there exists a to such that for t > to, et < emin- We may get a linear regret upto time to but after 
that by the analysis of Theorem [2| regret grows only sub-linearly. Since C(-) is monotonically 
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increasing, C^f^Lr)) > C{f{Lt)),^t < T, we get the desired result. The last part is illustrative 
and can be trivially established using the obtained bound on the regret in (ii). ■ 
Remarks. 1. We note that in the initial steps, our proof followed the proof of the main result 
in ttia. 

2. The UCB2 algorithm described in flU performs computations only at exponentially spaced time 
epochs. So, it is natural to imagine that a decentralized algorithm based on it could be developed, 
and get a better regret bound. Unfortunately, the single player UCB2 algorithm has an obvious 
weakness: regret is linear in the number of arms. Thus, the decentralized/combinatorial extension 
of UCB2 would yield regret growing exponentially in the number of players and arms. We use a 
similar index but a different scheme, allowing us to achieve poly-log regret growth and a linear 
memory requirement for each player. 

VL The Decentralized MAB problem with Markovian rewards 

Now, we consider the decentralized MAB problem with M players and arms where the 
rewards obtained each time when an arm is pulled are not i.i.d. but come from a Markov chain. 
The reward that player i gets from arm j (when there is no collision) Xij, is modelled as 
an irreducible, aperiodic, reversible Markov chain on a finite state space A'* -' and represented 
by a transition probability matrix P''-' := (^P^^''^/ : x, x' G X^'^^. Assume that X^'^ E (0, 1]. Let 
Xij(l), Xij{2), . . . denote the successive rewards from arm j for player i. All arms are mutually 
independent for all players. Let n^'^ := (tt^'-', a; G X^'^) be the stationary distribution of the 
Markov chain P*'-' . The mean reward from arm j for player i is defined as fiij := XIxga^'j ^^x"'- 
Note that the Markov chain represented by P* -' makes a state transition only when player i plays 
arm j. Otherwise, it remains rested. As described in the previous section, nj(t) is the number of 
successful plays (without collisions) of player i by frame t, Uij (t) is the number of times player 
i picks arm j successfully by frame t and Xij{t) is the sample mean of rewards from arm j for 
player i from nij{t) samples. Denote Amin ■= minkeP(Ar),k7^k" Ak and Amax ■= maxkeP(Ar) Ak. 

Denote TCmin '■= niilll<i<Af,l<j<Af,aeA''.J ^x'''^ ^max '■= ^^^l<i<M,l<j<N,x&Xi-:i ^ Xmin '■ = 

mmi<i^M,i<j<N,x&x-'^ X- Let tt^'^ := maxlvr^'^ 1 -vrj.'^} and ifmax ■= T^a^i<i<M,i<j<N,x&x-'j t^x^- 
Let \X^''^\ denote the cardinality of the state space A'''-', |A:'|maa: := niaxi<j<M,i<j<Ar Let 
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p*'^ be the eigenvalue gap, 1 — A2, where A2 is the second largest eigenvalue of the matrix P*'-'' 
Denote pmax ■= m.w<.i<i<M,i<j<N p"'^ and p^m := mini<j<Af,i<j<Ar p*'^. 

The total reward obtained by time T h St = Yl!j=i Yl!iLi YTs=i^^ the regret is 



M 



1=1 



j = l 1 = 1 s = l 



+ CE[m(T)]. 



(33) 



Define the index 



Ik log ni{t) 



(34) 



where k be any constant such that k > (112 + 56M)\X\l^^^/ pmin- 
We need the following lemma to prove the regret bound. 

Lemma 2. If the reward of each player-arm pair is given by a Markov chain, satisfying 
the properties of Lemma |?] then under any policy a 



TZmAT) < Yl ^knn\T)] + CE[miT)] + kx,P 



(35) 



where n^{T) is the number of times that the matching k occurred by the time T and Kx,p is 



defined as Kx,p = Y.f=i E*=i Y.x&x^':> ^1 



iw ■ 

mm. 

Proof: Let (Xj j(l), Xj ,,(2), . . .) denote the successive rewards for player i from arm j. 
Let Tl'^ denote the a-algebra generated by (Xjj(l), . . . , Xjj(t)), J^*'-' = \lt->\J^t^ and Q"^'^ = 
V(fc Since arms are independent, Q^'^ is independent of J-'*'-'. Clearly, nij{T) is a 
stopping time with respect to J^*'^ V Qj,-' . The total reward is 

N M nij{T) N M 

j=l i=l t=l j=l i=l x&X^'j 

where N{x,nij{T)) := Ylt=i^^ ^{-^iji't) = Taking expectations and using the Lemma 

N M 



j=l i=l x&X^'j 



N M 

^ E E E ^/^-" 

i=l j=l x^X^'i 
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which implies, 



N M 

j=i i=i 



< K 



x.p 



where Kx,p = J2f=i Y.f=i ExeA'-. ^I'^'^in- Now, 

N M N M M 

j=l j=l i=l i=l keV{N),{i,j)&k k&V(N) i=l 



where /x'^ = J^fLif^iA- Since regret is defined as in the equation (33), 



UmAT) - I Tfi** - Yl fii,kMni,kAT)] + CE4m{T)] 

ker(N),{i,j)ek 



< K. 



(36) 



The main result of this section is the following. 

Theorem 7. ( i) Let e > be the precision of the bipartite matching algorithm and the precision 
of the index representation. If A^m- I I max (^^^d pmin (^fe known, choose e > such that 
e < Amin/{M + 1) and k > (112 + 56M)|A'|^„^/pmm- Let L be the length of a frame. Then, 
the expected regret of the dUCB4 algorithm with index p?) ) for the decentralized MAB problem 
with Markovian rewards and per computation cost C is given by 

,dUCB4 

(T) 

^ + {2MD + 1)MN + Kx,p. 



< (LA„,. + C(/(L))(l + logT)) (M^^^^2 
Thus, TZM,djjcB,{T) = O(log^T). 

(ii) If Amin, I'^lmax <^nd pmin <^re unknown, denote e^m = Amm/(2(M + 1)) and let — )• oo 
as t ^ oo. Also, choose a positive monotone sequence {k^} such that nt ^ oo as t ^ oo and 
Kt < t. Then, TZMAvcsiiT) = 0{C{f{LT))KT\og^T). Thus by choosing an arbitrarily -slowly 
increasing sequences, we can make the regret arbitrarily close to log^ T. 



Proof: (i) We skip the initial steps as they are same as in the proof of Theorem |6] We start 
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by bounding nij{T) as defined in the proof of Theorem |6] Then, from equation (29), we get 

ni,j{T) 



m=l p=0 Sj^j.**=l SA.ft**=lo' —1 o' —1 «=1 



=1 

A/ 



< (M + l)e + 5^ (X,,,^ (m + 2^) + c^+2.,sQ} 

i=l 



(37) 



Now, the event in the parenthesis {•} above implies at least one of the events (Ai, Bi,C, D) 
given in the display (30). From the proof of Theorem |4] (equations (22 23), P(Aj) < D(m + 
2P) 28\xua.\^^ p(5.) < D{m + 2P) 2B|A-|™,,|2 ^ 1 < ^ < M. Similar to the steps in display 
( [3T] ), we can show that the event C is false. Also, the event D is false by assumption. So, similar 
to the proof of the Theorem [6] (c.f. display ( [321 ) we get, 

AM^KlogT 



E[n,,,(T)] < 



{Armn - (M + l)e)2 

m+2P m+2P 



oo oo m+2P m+2P 

m=l p=0 Si ,,**=l Sjv^ t« = 



^ ... ^ 2MD(m + 2P)" 



28 1 A" I Tnax I ' 



< 



< 



AM^KlogT 



(A™„-(M + l)e)2 
4M2Klogr 



oo oo 



+ 2MD 2P{m + 



28 1 A" I 'ffiax 



+ {2MD + 1). 



(A„,„-(M + l)e)2 
when K > (112 + 56M)|A'|^„^/pmj„. Now, putting it all together, we get 



M N 



Now, by proof of the Theorem |2] (equation ([B])), E[m(T)] < E[n(T)](l + logT). We can now 
bound the regret. 



M 



7^ 



M,dUCB4 



(T) 



J] Afc J]E[n,,fc,(T)] +CE[m(T)] +if;,,P 

k&V{N),k^k** i=l 



M 



Y E^t^*.'^'^^)] + CE[m{T)] + Kx,P 



k<^V{N),k^k'" i=l 
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= A^„,E[n(T)] + CE[m(r)] + kx,p. 

For a general L, by Theorem [T] 

^M,ducB.(T) < L^rna.nn{T)] + C{f{L))¥.[m{T)] + kx,p. 

< {LAma. + + logT))E[n(T)] + iT^f.p. 

Now, using the bound ([38]), we get the desired upper bound on the expected regret. 

(ii) This can now easily be obtained using the above and following Theorem [6j ■ 

VIL Distributed Bipartite Matching: Algorithm and Implementation 

In the previous section, we referred to an unspecified distributed algorithm for bipartite 
matching dBM, that is used by the dUCB4 algorithm. We now present one such algorithm, namely, 
Bertsekas' auction algorithm [fTTl . and its distributed implementation. We note that the presented 
algorithm is not the only one that can be used. The dUCB4 algorithm will work with a distributed 
implementation of any bipartite matching algorithm, e.g. algorithms given in [l24ll . 

Consider a bipartite graph with M players on one side, and N arms on the other, and M < N. 
Each player i has a value fXij for each arm j. Each player knows only his own values. Let us 
denote by k**, a matching that maximizes the matching surplus l^ij^ij^ where the variable 
Xij is 1 if z is matched with j, and otherwise. Note that — 1)^J' ^i^d — 

Our goal is to find an e-optimal matching. We call any matching k* to be e-optimal if 

Algorithm 3 : dBM^ ( Bertsekas Auction Algorithm) 

1: All players i initialize prices pj = 0,V channels j; 
2: while (prices change) do 

3: Player i communicates his preferred arm j* and bid bi = maxj(/iij — Pj) — 2maXj{piij — Pj) + jj to all 
other players. 

4: Each player determines on his own if he is the winner i* on arm j; 
5: All players set prices pj — p,i*j', 
6: end while 

Here, 2maXj is the second highest maximum over all j. The best arm for a player i is arm 
j* = argmaXj(yUjj — Pj)- The winner i* on an arm j is the one with the highest bid. 
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The following lemma in [17] establishes that Bertsekas' auction algorithm will find the e- 
optimal matching in a finite number of steps. 

Lemma 3. / ITTI/ Given e > 0, Algorithm |£] with rewards ^ij, for player i playing the jth arm, 
converges to a matching k* such that 'Yliil^i,k**(i) ~ '^if^i,k*{i) ^ e where k** is an optimal 
matching. Furthermore, this convergence occurs in less than (M^ maxj ,,})/e iterations. 

The temporal structure of the dUCB4 algorithm is such that time is divided into frames of 
length L. Each frame is either a decision frame, or an exploitation frame. In the exploitation 
frame, each player plays the arm it was allocated in the last decision frame. The distributed 
bipartite matching algorithm (e.g. based on Algorithm |3]), is run in the decision frame. The 
decision frame has an interrupt phase of length M and negotiation phase of length L — M. We 
now describe an implementation structure for these phases in the decision frame. 




Fig. 1. Structure of the decision frame 



Interrupt Phase: The interrupt phase can be implemented very easily. It has length M time 
slots. On a pre-determined channel, each player by turn transmits a '1' if the arm with which 
it is now matched has changed, '0' otherwise. If any user transmits a '1', everyone knows that 
the matching has changed, and they reset their counter rj = 1. 

Negotiation Phase: The information needed to be exchanged to compute an e-optimal matching 
is done in the negotiation phase. We first provide a packetized implementation of the negotiation 
phase. The negotiation phase consists of J subframes of length M each (See figure [T]). In each 
subframe, the users transmit a packet by turn. The packet contains bid information: (channel 
number, bid value). Since all users transmit by turn, all the users know the bid values by the end 
of the subframe, and can compute the new allocation, and the prices independently. The length 
of the subframe J determines the precision e of the distributed bipartite matching algorithm. 
Note that in the packetized implementation, ei = 0, i.e., bid values can be computed exactly. 
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and for a given £2, we can determine J, the number of rounds the dBM algorithm [3] runs for, and 
returns an e2-optimal matching. 

If a packetized implementation is not possible, we can give a physical implementation. Our 
only assumption here is going to be that each user can observe a channel, and determine if there 
was a successful transmission on it, a collision, or no transmission, in a given time slot. The whole 
negotiation phase is again divided into J sub-frames. In each sub-frame, each user transmits by 
turn. It simply transmits [logM] bits to indicate a channel number, and then [logl/ei] bits to 
indicate its bid value to precision ei. The number of such sub-frames J is again chosen so that 
the dBM algorithm (based on Algorithm [3]) returns an e2 -optimal matching. 

VIII. Simulations 

We illustrate the empirical performance of the dUCB4 algorithm when the successive rewards 
from a channel are i.i.d. and when they are Markovian. Consider two users and two channels. 
In the i.i.d. case, each channel has rewards that are generated by a Bernoulli distribution taking 
values and 1. The first user has mean rewards of 0.8 and 0.6 for channels 1 and 2 respectively. 
The second user has mean rewards of 0.6 and 0.35. The algorithm's performance, averaged over 
50 runs, is shown in Figure |2](i). It shows cumulative regret with time. The red bold curve is the 
theoretical upper bound we derived, while the blue curve is the observed regret. The algorithm 
seems to perform much better than even the poly-log regret upper bound we derived. 

In the Markovian case, rewards are generated by a Markov chain having states and 1. The 
mean reward on a channel is given by its stationary distribution, i.e., the probability the Markov 
chain is in state 1, tti. The properties of the Markov chains are given in Table |lj The performance 
of the dUCB4 algorithm on this model, averaged over 50 runs, is shown in Figure [2] (ii). Once 
again, the algorithm seems to perform much better than even the poly-log regret upper bound 
we derived. 

IX. Conclusions 

We have proposed a dUCB4 algorithm for decentralized learning in multi-armed bandit problems 
that achieves a regret of near-0(log^(T)). Finding a lower bound is usually quite difficult, and 
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TABLE I 

Markov Chain Parameters : Transition probability and Stationary distribution 



User 


Channel 


Poi,Pio 


TT 


1 


1 


0.3,0.5 


0.3/0.8 


1 


2 


0.2,0.6 


0.2/0.8 


2 


1 


0.6,0.3 


0.6/0.9 


2 


2 


0.7,0.2 


0.7/0.9 



currently a work in progress. 
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Appendix 

Let (Xj, t = 1, 2, . . .) be an irreducible, aperiodic and reversible Markov chain on a finite state 
space X with transition probability matrix P, a stationary distribution vr and an initial distribution 
A. Let J^t be the a-algebra generated by (Xi,X2, . . . Denote N\ = 

Lemma 4. ^ Let Q be a a-algebra independent of T = \/t>iFt- Let t be a stopping time of 
J=t V Q. Let N{x, r) := J2l=i ^{^t = x}. Then, \E[N{x, r)] - 7r^E[r]| < K, where K < l/7r^i„ 
and TTmin = ^i^xex t^x- K depends on P. 

. Let p be the eigenvalue gap, 1 — A2, where A2 is 

2 

the second largest eigenvalue of the matrix P^. Let f : X ^ H be such that J2xex '^xf{,x) = 0, 
ll/IL < 1. II/II2 < 1- Then, for any > 0. F (SLi /(A'„)/i > 7) < N,e~'r''m. 



Lemma 5. / |27]/ Denote Nx 



.X e X 
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